Making Sense of Summer Exercising Data

Over the summer of 2019, the deidentified subject started increasing exercise. Data was collected via a variety of devices and published to the social sharing site strava.

The following is my attempt at cleaning, transforming and understanding this real-time, messy & complicated workout data set sourced from multiple exercise wearable devices. Let's dive in.

Part 0. Import Data and Dependent Libraries

The following libraries will be used in this notebook

Import our data set strava.csv.

Note: my Jupyter notebook is placed in the same folder as source data file strava.csv, feel free to update the below file path accordingly to suit where you stored the file.

Part I. Data Cleaning

Step 1: Let's start with making sure that all time fields and position fields in the dataframe are converted to the appropriate format

The dataframe contains both data file source and timestamp, but both are not the best way to parse the dataset. You can have multiple exercises in a day and you may have one exercise across multiple data file sources. The ideal way is to identify unique exercises.

Step 2: Identify and segment unique exercises within the dataframe using time difference between timestamps

Step 3: parse out date from timestamp

Step 4: Create a summary dataframe with basic aggregated values grouped by each unique exercise

Step 5: Use the aggregated values at exercise level to calculate a few more metrics and classify each exercise

Step 6: Create an unique name for each exercise

Step 7: Enhance the strava df with aggregated metrics from the summary_df

Part II. Data Visualization and Findings

A. Metrics Correlation Analysis

Are there any correlation between the health and intensity indicator metrics in the Strava dataset, such as heart rate, speed, altitude, power, and form power?

There are lots of physical metrics captured within the Strava dataframe. I am especially curious in exploring health and intensity indicator metrics in the Strava dataset, such as heart rate, speed, altitude, power, form power and air power.

Scatter plots are a good way to visualize data point clusters, while identifying relational and directional patterns between metrics. Scatterplot matrics (or sploms) expand upon the strength of a scatter plot alone and provide a birds-eye view of coorelation patterns across multiple metrics at once. This makes splom an excellent choice for this question.

The Seaborn library is the best for splom. It is very simple to create sploms with the pairplot function, often requring a single line of code. One can pass a dataframe with numerical variables, and get a nice scatter plot matrix.

Insights:

B. Next steps: Narrowing in on Heart Rate & Enhanced Speed

In addition to the insights above, Splom analysis also shows us there are a good amount of heart rate and enhanced speed available to compare across the Bike and Cycling exercises. That sounds interesting. Let's dive into that a little further.

How does the subject's heart rate activities compare during running vs. biking?

To compare heart rate data between running and biking, we are going to use a boxplot. Boxplots are great at representing the range of data present, where the majority of data points concentrate, identifying the median and even outliers. Distribution is a big part of what we are after looking at heart rate activites across exercise types. Therefore, boxplots' ability to quickly call out these important aspects of the data makes it a fitting choice.

Before we start ploting, let's do some quick checks on how many exercises we have for each category to understand our sample size:

We have ample runs in our dataset, but the sample size for cycling is a little small (<25). Something to keep in the back of our mind as we continue the analysis.

On to boxplots, but first a little bit more data cleaning

We'll use the plotly.express module library to make the plot. In my personal experience so far, this library is terse in code but robust in output. In someways, it combines the convinence of matplotlib and the design of more client-friendly libraries like Altairs. I also find the in-built tooltip and hover windows options really helpful. These can serve as dynamic call-outs without crowding the visualization.

Using the above boxplot, we can compare the range and distribution of the heart rate for running and biking.

Insights:

Note: we are at a disadvantage with a small sample size of biking exercises data collected with the Strava dataset, the range, median and mean for biking heart rate may change if we have more samples.

What's Next? A Need for Speed

Now that we've looked at heart rate differences between the exercises, let's spend some time with another metric that also had a decent amount of data for both exercise groups - Enhanced Speed, in m/s - and see what insights we can gain from looking into its data.

You might recall that the average of this particular metric per exercise was used earlier to determine whether an exercise would be classified as a running event or biking event. In this next step, I'd like to plot the bird-eye view of all exercises sessions' with line plots of speed over time. We should be able to tell easily based on speed distribution if an exercise is a run or cycling session.

It's quite a large plot, I often get the browser warning of its significant memory consuption, but it's worth it. This was one of first plots I did when I started cleaning the dateframe, and I find myself coming back to this plot over and over again as I gained more insight into how the data could be organized. I have managed to find new insights almost every time. Both at a macro level - behavioral patterns across exercises types, and down to the granular patterns of breaks. Been able to see all the exercises visually really helps connect the dots. Some of the questions I wanted to explore are listed below:

Is speedy maintained for longer sessions compared to shorter within the same exercise?

Are there interesting behavioral or time patterns that jump out and worth exploring further?

So let's get started, first we are going to define the data pieces that we are interested in and create a more concise dataframe

I am using Plotly Express library again to produce the visualization, for its code conciseness and wide range of customization options to produce high-quality charts.

To plot speed over time, I chose line chart. Line charts are a good visualization technique to look at continuous trends. We are ploting speed over time per unique exercise. It is a reason to assume that the activity recorded within the exercise would be mostly continuous, with possible breaks under 30 minutes. Recall earlier we use 30 minutes as our cut off to define separate exercises or the subject taking a water break in between reaps of otherwise continuous exercise.

That's a monster chart - with lots of nuggets of information that we can use! Below are some of my insights for the questions listed.

Insights for Q1:

Insights for Q2:

These runs in particular peaked my interest, let's see if we can go deeper in the next section.

C. Location, Location, Location

Where can one go running after midnight around A2 and other interesting geomapping questions

Selfishly, as an ex-A2 townie (don't worry, my phone area code will always be 734), I can't help but want to look deeper into the routes of the subject's runs and bikes. Has he ever run by my old high school (previously a Salamander swamp, Go Eagles!)? Does he visit my favorite trails along the huron river? What about the Arb where everyone and their mother took their graduation photos?

After spotting some of the behavioral trends from the speed analysis above, I have more reason to map out his exercises over the summer and dig deeper on follow-up questions listed below:

Where does the subject go to exercise? Are there general favorite routes/stops/locations the subject likes to visit?

Where does the subject go on these late night runs?

I will be using the folium library for this exercise. I find it the easiest to use out of all mapping options across different libraries. The OpenStreetMaps background is also excellent to identify Ann Arbor landmarks and roads that brings back memories.

Map with all Routes Selected

Insights Q1 :

Map to Play With!

The map above shows all routes. It's useful to see where routes tend to concentrate, not so good for drill down analysis on individual exercise.

The map below, however, is a more interactive map that allows you to chose which exercise(s) you would like to see via the layer button in the top right corner of the map. This is a sandbox for you to explore some more.

I used this map to dig a little deeper into the routes the subject took on those three overnight runs mention from the last section. My insights are listed below.

Insights Q2:

Midnight runs:

1) 7/17 ex 7 Run 61 minutes

    - Started at Mixwood St. , down Miller, down Packard, many of loops around the Ferry Field in front of the IM building

2) 7/24 ex 14 Run 64 minutes

    - Started at Mixwood St., out the neighborhood, down North Main and through Bandemer Park (where I took my grad photos!), down the board walk, and curiously at Pontiac and Argo seems like the wearables failed to keep up and had to restart. I initially thought this was a break from the Speed vs. Time line graph above, however, looking at the route map, it's more likely that the device restarted since next lat. and long. place Subject on his way home. Nice, couldn't have completed the picture on this one without both of these visualizations. 

3) 8/7 ex 27 Run 76 minutes

    - Almost the exact route 7/17. Started at Mixwood St., down Miller, down Packard, many of loops around the Ferry Field in front of the IM building. Only two data points so far, but I have an inkling that this could be an tried and true route. As we reach the end of this exercise, this would be a great topic to continue to explore in the future. 

Feel to checked out other route patterns further using this map!

Tip: Hover over each route for a tooltip that will help to quickly identify the exercise date, type and the exercise ID. Click on each route for a pop up that reveals more information for the selected exercise on start and end time, duration of exercise, total distance, average speed and average heart rate.

Dependencies